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Abstract:  This  paper  presents  the  results  of  the  final  year  of  a  development  effort  to 
provide  a  scalable,  portable,  parallel  scene  generation  tool  that  provides  the  capability  to 
rapidly  generate  scenes  of  radiating  and  scattering  structures  in  realistically  complex 
electromagnetic  environments.  The  benefit  of  such  a  tool  is  that  it  will  provide  users  with 
the  capability  to  solve  large  problems  that  cannot  be  currently  solved  with  existing 
sequential  electromagnetic  modeling  tools.  This  tool  supports  a  broad  range  of  users 
including  researchers,  algorithm  developers,  analysts,  and  system  developers.  This  paper 
will  present  the  parallelization  process  and  will  show  the  final  results  of  the  project. 

The  project  presented  here  is  the  parallelization  of  WIPL-D,  an  electromagnetic  modeling 
tool,  which  picks  up  in  time  from  where  [1]  left  off.  Through  parallelization,  the  well 
known  and  commercially  available  tool  became  faster  and  now  possesses  increased 
capabilities.  This  paper  will  walk  you  through  the  parallelization  process,  providing  the 
strategies  used  and  the  results  received. 
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1.  Background 

The  parallel  electromagnetic  modeling  tool  (WIPL-DP)  that  has  been  developed  under 
this  effort  is  leveraged  from  a  proven  commercially  successful  sequential  electromagnetic 
modeling  and  scene  generation  tool  called  WIPL-D  (Wires,  Plates  and  Dielectrics)  which 
is  described  in  [1].  This  WIPL-DP  effort  was  funded  under  the  Common  High 
Performance  Computing  Software  Support  Initiative  (CHSSI)  as  described  in  [1]. 

This  was  a  three  year  development  effort  with  three  major  milestone  reviews.  We  have 
completed  and  successfully  passed  all  of  the  evaluations.  The  three  major  milestone 
reviews  were  used  to  determine  if  the  project  was  meeting  the  guidelines  set  forth  by  the 
CHSSI  program.  The  first  two  years  are  discussed  in  detail  in  [1].  This  paper  is  a 
continuation  and  picks  up  at  the  start  of  the  third  and  final  year.  The  results  from  all  work 
completed  for  the  final  milestone  Beta  review  will  be  presented  here. 

2.  WIPL-DP  Development  -  Final  Year 

The  third  and  final  year  of  the  WIPL-DP  development  effort  completed  in  2004.  The 
goal  for  the  last  year  was  to  further  increase  the  performance  of  WIPL-DP  by  decreasing 
run  time  and  adding  additional  functionality.  This  was  accomplished  by  parallelizing  the 
impedance  matrix  generation  and  solution,  and  the  far  field  calculations.  This 
parallelization  was  in  addition  to  the  frequency  domain  parallelization  completed  in  the 
previous  year,  described  in  [1].  The  results  were  presented  at  the  Beta  Test  milestone 
which  occurred  in  June  2004.  The  three  specific  critical  test  parameters  that  had  to  be  met 
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were  as  follows.  The  scaled  speedup  was  to  exceed  80%  (25%  as  a  minimum)  of  the 
single  processor  C  version  on  64  nodes  (32  nodes  as  a  minimum).  The  code  was  to  be  run 
on  three  High  Performance  Computing  platforms  (minimum  of  2)  producing  valid  results. 
The  single  processor  version  of  the  code  was  to  produce  an  optimal  accuracy  of  2%  (3% 
minimum)  when  compared  to  the  commercial  version  of  WIPL-D. 

The  parallelization  of  the  impedance  matrix  generation  provided  increased  functionality, 
as  well  as  improved  speedup.  By  distributing  the  matrix,  a  greater  number  of  unknowns 
were  able  to  be  used,  thus  allowing  for  larger  simulations  to  be  solved.  The  impedance 
matrix  is  made  up  of  complex  values  and  has  the  dimensions  of  number  of  unknowns  by 
number  of  unknowns.  Each  processor  performing  the  generation  holds  a  fraction  of  the 
complete  matrix.  The  number  of  processors  used  was  determined  by  dividing  the  number 
of  unknowns  for  the  project  by  the  number  of  unknowns  that  can  be  held  in  a  single 
processor’s  memory.  The  matrix  is  constructed  by  looping  over  the  number  of  elements 
(number  of  wires  and  plates)  and  performing  impedance  calculations.  There  is  an  outer 
and  inner  loop  that  cycle  over  the  number  of  elements.  The  outer  loop  iterations  are 
independent  of  each  other,  except  for  accumulations  at  the  end  of  the  inner  loop.  Each 
processor  entering  the  loop  was  assigned  iterations  to  be  executed  in  parallel.  Results 
prior  to  the  accumulation  were  saved  off,  and  the  actual  accumulations  were  performed 
afterward.  All  of  the  calculations  in  the  matrix  construction  are  distributed  over  the 
number  of  processors  performing  the  generation,  providing  processing  speed  benefits. 

The  solving  of  the  impedance  matrix  was  the  major  bottleneck  of  the  program, 
consuming  around  75%  of  the  processing  time.  Through  parallelization  the  solution  took 
significantly  less  processing  time.  Solving  the  system  of  equations  was  originally  done 
using  LU  Decomposition.  In  the  commercial  version  matrix  solution,  the  order  of 
execution  creates  dependencies  between  each  iteration  of  the  main  loop.  In  order  to 
parallelize  the  solution,  the  section  of  code  that  performs  the  current  matrix  solution  was 
replaced  with  a  parallel  implementation.  The  impedance  matrix  is  placed  in  a  form  that  is 
needed  by  a  parallel  implementation  through  a  distribution  equation  and  then  the  solution 
can  be  transformed  back  to  the  form  required  for  WIPL-D.  The  parallel  complex  LU 
Decomposition  function  that  was  implemented  is  contained  in  the  Scalapack  libraries.  It 
handles  the  decomposition  by  distributing  blocks  of  data  among  a  rectangular  grid  of 
processors  in  a  2-Dimensional  block-cyclic  method.  This  distribution  of  data  provides  for 
good  load  balancing  and  cuts  down  on  the  amount  of  broadcasting.  This  solver  was 
chosen  since  Scalapack  is  highly  portable  and  free,  although  it  is  not  the  most  efficient 
parallel  matrix  solver  out  there.  The  code  was  implemented  such  that  different  parallel 
matrix  solving  functions  may  be  substituted  in  to  perform  the  matrix  solution. 

The  parallelization  of  the  far  field  calculations  allowed  for  scaled  speedup.  This  portion 
of  the  code  loops  over  the  phi  and  theta  directions  present  for  the  problem.  The  outer  loop 
of  the  far  field  calculations  contains  independent  iterations.  The  total  number  of 
directions  was  divided  among  the  available  processors  to  provide  for  scaled  speedup. 
There  was  output  file  printing  contained  in  the  loop  which  had  to  be  pulled  out  and  then 
merged  at  the  end  in  order  to  preserve  the  WIPL-D  file  structure. 

For  the  Beta  Test,  the  High  Performance  Computers  that  were  selected  to  meet  the 
requirements  set  forth  by  CHS  SI  were  Huinalu  and  Tempest  at  the  Maui  High 
Performance  Computing  Center  and  the  Compaq  SC/45  at  the  Aeronautical  Systems 
Center  Major  Shared  Resource  Center  (ASC  MSRC).  The  Huinalu  and  Tempest 
machines  are  described  in  detail  in  [1].  The  Compaq  SC/45  machine  contained  128 


nodes,  with  4  EV6.8  Alpha  1GHz  processors  with  1  GB  of  memory  per  node.  A  modified 
version  of  the  test  case  used  for  the  Alpha  Test  in  [1]  was  used  for  the  Beta  Test.  Changes 
were  made  to  make  the  model  more  realistic,  which  can  be  seen  in  Figure  1. 


Figure  1.  Cell  Phone  Alongside  a  Dielectric  Ovoid 

This  test  case  contained  5,662  unknowns  and  was  run  over  the  frequency  range  of  0.9  to 
2.4  GHz.  The  performance  tests  consisted  of  running  the  parallel  code  on  1  to  64 
processors,  with  the  number  of  frequencies  for  each  case  equal  to  the  number  of 
processors  used.  This  allowed  us  to  increase  the  problem  size/time  to  demonstrate  scaled 
speedup.  The  scaled  speedup  was  calculated  by  taking  the  time  for  the  code  to  run  on  a 
single  processor  and  comparing  it  to  the  time  it  took  on  N  processors  (T(1)/T(N)).  The 
accuracy  of  the  code  was  determined  by  comparing  the  output  files  generated  by  the 
parallel  code  to  the  output  files  created  by  running  the  original  WIPL-D  PC  version. 
Originally,  three  tests  were  planned  for  each  case.  The  frequency  and  matrix 
parallelizations  were  to  be  tested  separately  and  in  conjunction  with  each  other  as  a 
hybrid.  For  the  formal  Beta  Review  only  the  hybrid  case  was  requested  for  determining  if 
the  code  met  the  requirements.  The  speedup/timing  results  for  the  three  target  machines 
can  be  seen  in  Figures  2-4.  They  contain  all  three  separate  cases  described  above. 
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Figure  2.  Scaled  Speedup  Results  for  Tests  Run  on  the  Compaq  SC/45 


Huin^lu  -  Seated  Speed-Up  (Vs.  Scalapack  Solver) 
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Figure  3.  Scaled  Speedup  Results  for  Tests 
Run  on  the  Huinalu  Finux  Cluster 


Figure  4.  Scaled  Speedup  Results  for  Tests 
Run  on  the  Tempest  IBM  SP3 


The  above  figures  show  that  the  Hybrid  case  exceeded  the  optimum  requirement  of  80% 
scaled  speedup  for  all  but  the  64  processor/frequency  case,  where  it  was  well  above  the 
minimum  requirement.  The  frequency  parallelization  alone  performed  exceptionally  well 
for  all  cases.  The  matrix  parallelization  scaled  speedup  fell  off  as  the  number  of 
processors/frequencies  increased  due  to  the  matrix  size  we  were  using  being  relatively 
small.  When  the  matrix  was  divided  by  more  processors,  the  size  of  the  local  matrix 


partitions  held  on  each  processor  became  so  small  that  the  communication  overhead’s 
impact  was  more  significant.  Another  factor  was  that  the  parallel  matrix  solution  routine 
we  used  was  not  highly  efficient  for  this  case.  As  mentioned  above,  the  Scalapack 
libraries  were  chosen  based  on  project  requirements,  not  performance. 

The  accuracy  results  were  determined  by  taking  the  percent  error  between  the  parallel 
version’s  output  files  and  the  commercial  PC  version.  A  summary  of  these  results  can  be 
seen  in  Table  1.  The  worst  case  percent  error  is  listed  for  each  file.  The  overall  worst  case 
error  of  0.082%  fell  well  below  the  minimum  requirement  of  2%. 


|  Accuracv  Summary  | 

Output  File 

Max  Error  (%) 

Frequency  Case 

HPCC  System 

a  (11 

0.0027746 

16  (F,  M,  H) 

Tempest,  Huinalu 

till 

0.0048228 

16  (F,  M,  H) 

Tempest,  Huinalu,  Compaq 

nf  1 

0.0221264 

64  (F,  M,  H) 

Compaq 

ral 

0.0823788 

4  (F,  M,  H) 

Tempest,  Huinalu,  Compaq 

Table  1.  Percent  Error  for  Parallel  WIPL-D 


The  above  listed  results  met  the  Critical  Test  Parameters  set  forth  by  the  CHS  SI  Program 
for  the  Beta  Review.  The  project  was  named  a  success  by  the  CHSSI  office. 

3.  Future  Developments 

With  the  CHSSI  Project  being  completed,  some  follow  on  projects  are  being  looked  into. 
The  applications  team  would  have  liked  to  see  the  solvable  number  of  unknowns  meet 
and  exceed  100,000  for  the  tool  to  be  highly  beneficial.  At  this  point  the  number  of 
solvable  unknowns  is  around  30,000.  Late  in  the  project  a  problem  was  identified  with 
high  numbers  of  unknowns  that  unfortunately  was  not  able  to  be  resolved  before  the  end 
of  the  project  due  to  time  and  money  constraints.  Increasing  the  number  of  unknowns 
was  not  a  requirement  under  this  project,  but  rather  an  outcome  that  was  desired  from  it. 
Other  desired  improvements  to  the  code  include  replacing  the  Scalapack  libraries  with  a 
more  efficient  parallel  solver  and  creating  a  graphical  user  interface  WIPL-DP.  Also, 
since  the  version  of  WIPL-D  that  we  were  provided  with  is  several  versions  older 
compared  to  the  latest,  we  would  like  to  incorporate  any  changes  that  have  been  made. 

4.  Conclusion 

Parallel  WIPL-D  provides  users  with  an  electromagnetic  simulation  modeling  tool  that 
can  solve  larger  problems  quickly  and  with  increased  capability.  The  code  has  passed  all 
of  the  requirements  set  forth  by  the  CHSSI  Program,  far  exceeding  the  optimum 
requirements  in  all  but  one  case.  The  ability  to  solve  a  larger  number  of  unknowns  and 
decreased  processing  time  provide  an  improved  tool,  but  in  order  for  Tri-Service 
applications  to  highly  benefit  from  this  software  the  solvable  number  of  unknowns  must 
be  raised  even  further.  This  continued  work  will  need  be  performed  under  follow  on 
projects  since  the  CHSSI  Program  has  ended. 
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